False, we would not need 3 dummy variables, only 2. If we did 3, we would be creating a linear combination of the first two columns. This is known as the dummy variable trap.
False, we do not prefer such high order models because they have high sampling variability and risk overfitting the data.
True, since the terms are multiplied, the scale of the interaction depends on the value of the terms multipled.
False, separate models, while more flexible, are not as efficient and have more sampling variablity than a joint model.
# Libraries
library(MASS)
library(ggplot2)
library(plotly)
library(GGally)
cars <- read.csv("Cars.csv")
cars
# Fixing horsepower column
cars$horsepower <- as.numeric(cars$horsepower)
## Warning: NAs introduced by coercion
# Find index of NA rows
NArows <- which(is.na(cars))
# View rows
cars[NArows,]
str(cars)
## 'data.frame': 397 obs. of 7 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ country.code: int 1 1 1 1 1 1 1 1 1 1 ...
We should treat the variables “cylinders” and “country.code” as categorical variables. The rest of the variables should be treated as quantitative variables.
cars$cylinders <- as.factor(cars$cylinders)
cars$country.code <- as.factor(cars$country.code)
# MPG
ggplot(data = cars, aes(x = mpg)) + geom_histogram(bins = 15)
#Displacement
ggplot(data = cars, aes(x = displacement)) + geom_histogram(bins = 10)
# Horsepower
ggplot(data = cars, aes(x = horsepower))+ geom_histogram(bins = 20)
## Warning: Removed 5 rows containing non-finite values (stat_bin).
# Weight
ggplot(data = cars, aes(x = weight)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Acceleration
ggplot(data = cars, aes(x = acceleration)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
From the histograms we notice the following: is slightly right skewed, displacement is heavily right skewed, horsepower and weight is moderately right skewed, and acceleration is approximately normal.
ggpairs(cars[,c(1, 3, 4, 5, 6)], title = "Correlogram of Cars Data")
We can see from the scatter plot matrix that we face multicollinearity with predictors in this data. The variable mpg, displacement, horsepower, and weight are all highly correlated with each other, with the highest being the correlation between displacement and weight with a correlation of \(0.933\).
library(plotly)
fig1 <-
plot_ly(
data = data.frame(table(cars$cylinders)),
labels = ~ Var1,
values = ~ Freq,
type = 'pie'
)
fig1 <-
fig1 %>% layout(
title = "Number of Cylinders",
xaxis = list(
showgrid = FALSE,
zeroline = FALSE,
showticklabels = FALSE
),
yaxis = list(
showgrid = FALSE,
zeroline = FALSE,
showticklabels = FALSE
)
)
fig1
fig2 <-
plot_ly(
data = data.frame(table(cars$country.code)),
labels = ~ Var1,
values = ~ Freq,
type = 'pie'
)
fig2 <-
fig2 %>% layout(
title = "Pie Chart of Country Code",
xaxis = list(
showgrid = FALSE,
zeroline = FALSE,
showticklabels = FALSE
),
yaxis = list(
showgrid = FALSE,
zeroline = FALSE,
showticklabels = FALSE
)
)
fig2
ggplot(data = cars, aes(x = cylinders, y = mpg, fill = country.code)) + geom_boxplot()
We can seet that only cylinder options that were available in all three countries were 4 cylinders and 6 cylinders. Some countries only had one cylinder option. Four cylinder cars performed, on average, the best in miles per gallon, regardless of the country. Eight cylinder cars had the worse mpg performance. Four cylinder cars in country code 3 performed the best of all other categories on average.
ggplot(data = cars, aes(x = mpg)) + geom_histogram(bins = 20)
cars.Model <- lm(mpg ~ cylinders + horsepower + weight, data = cars)
bc <- boxcox(cars.Model)
From the histogram and the boxcox plot, we can see that mpg is
moderately right skewed, because of this, we will perform a
transformation chosen by the boxcox transformaton.
# Find best lambda
lambda <- bc$x[which.max(bc$y)]
# make new data column
cars$mpgTrans <- (cars$mpg^lambda - 1)/ lambda
cars.Model.T <- lm(mpgTrans ~ cylinders + horsepower + weight, data = cars)
summary(cars.Model.T)
##
## Call:
## lm(formula = mpgTrans ~ cylinders + horsepower + weight, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.125927 -0.027564 -0.001233 0.031690 0.161618
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.018e+00 2.733e-02 73.833 < 2e-16 ***
## cylinders4 7.577e-02 2.274e-02 3.332 0.000945 ***
## cylinders5 1.049e-01 3.474e-02 3.018 0.002712 **
## cylinders6 3.811e-02 2.354e-02 1.619 0.106239
## cylinders8 4.402e-02 2.521e-02 1.746 0.081652 .
## horsepower -8.508e-04 1.345e-04 -6.326 6.98e-10 ***
## weight -6.122e-05 6.939e-06 -8.822 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04474 on 385 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.8223, Adjusted R-squared: 0.8196
## F-statistic: 297 on 6 and 385 DF, p-value: < 2.2e-16
From the F test, we can see a very small p-value, indicating that at least one of the coefficients is significant. Both weight and horsepower are highly significant. Cars with 4 and 5 cylinders are significant however those with 6 and 8 cylinder engines are not. The \(R^2\) and \(R^2_{adj}\) are 0.8223 and 0.8196 respectively, indicating a strong fit. The model is adequite.
\[ \hat{Y_i} = (2.018 + 0.07577) - 0.00085X_{hp} - 0.000061X_{wt} \] ### d.)
cars.Model.Int <- lm(mpgTrans ~ cylinders + horsepower + weight + cylinders*horsepower + cylinders*weight, data = cars)
summary(cars.Model.Int)
##
## Call:
## lm(formula = mpgTrans ~ cylinders + horsepower + weight + cylinders *
## horsepower + cylinders * weight, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.12550 -0.02515 0.00000 0.02787 0.17560
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.418319 10.587892 -0.323 0.747
## cylinders4 5.521017 10.587914 0.521 0.602
## cylinders5 5.940183 10.594256 0.561 0.575
## cylinders6 5.429651 10.588039 0.513 0.608
## cylinders8 5.461130 10.587969 0.516 0.606
## horsepower 0.263444 0.562359 0.468 0.640
## weight -0.008731 0.018858 -0.463 0.644
## cylinders4:horsepower -0.264935 0.562359 -0.471 0.638
## cylinders5:horsepower -0.268299 0.562362 -0.477 0.634
## cylinders6:horsepower -0.263218 0.562359 -0.468 0.640
## cylinders8:horsepower -0.264355 0.562359 -0.470 0.639
## cylinders4:weight 0.008688 0.018858 0.461 0.645
## cylinders5:weight 0.008648 0.018858 0.459 0.647
## cylinders6:weight 0.008650 0.018858 0.459 0.647
## cylinders8:weight 0.008677 0.018858 0.460 0.646
##
## Residual standard error: 0.04373 on 377 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.8337, Adjusted R-squared: 0.8276
## F-statistic: 135 on 14 and 377 DF, p-value: < 2.2e-16
\[ \hat{Y_i} = (-3.418 + 5.521) + (0.263 - 0.265)X_{hp} + (-0.009 + 0.009)0.000061X_{wt} \]
anova(cars.Model.Int, cars.Model.T)
We can see from the anova function, we would reject \(H_0\) concluding that the full model with interactions is a better fit.
# For non interaction model
predict(cars.Model.T, data.frame(cylinders = "4", horsepower = 100, weight = 3000), interval = "prediction", level = 0.95)
## fit lwr upr
## 1 1.824593 1.736006 1.913179
# With interaction model
predict(cars.Model.Int, data.frame(cylinders = "4", horsepower = 100, weight = 3000), interval = "prediction", level = 0.95)
## fit lwr upr
## 1 1.822971 1.73587 1.910073
We find that both predictions are almost identical to each other for both models.